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Abstract. We compute an explicit formula for the expected value of the Colless 
index of a phylogenetic tree generated under the Yule model, and an explicit formula 
for the expected value of the Sackin index of a phylogenetic tree generated under 
the uniform model. 



1 Introduction 

A phylogenetic tree is a representation of the shared evolutionary history of a set 
of extant species. Prom the mathematical point a view, it is a leaf-labeled rooted 
tree, with its leaves representing the extant species under study, its internal nodes 
representing common ancestors of some of them, the root representing the most 
recent common ancestor of all of them, and the arcs representing direct descendance 
through mutations. In this paper we only consider binary phylogenetic trees, where 
every internal node has exactly two children. 

There are several stochastic models of the evolutionary processes that produce 
phylogenetic trees. Two of the most popular are the Yule and the uniform models. 
In the Yule, or Equal-Rate Markov, model [8 23J, starting with a node, at every 
step a leaf is chosen randomly and uniformly, and it is replaced by a cherry (a 
phylogenetic tree consisting only of a root and two leaves). Finally, the labels 
are assigned randomly and uniformly to the leaves once the desired number of 
leaves is reached. Under this model of evolution, different trees with the same 
number of leaves may have different probabilities. More specifically, if T is a binary 
phylogenetic tree with n leaves, and for every internal node z we denote by kt(z) 
the number of its descendant leaves, then the probability of T under the Yule 
model is [4l2l] 
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On the other hand, the main feature of the uniform, or Proportional to Distin- 
guishable Arrangements, model |17j is that all phylogenetic trees with the same 
number of leaves have the same probability. From the point of view of tree growth, 
this corresponds to a process where, starting with a node labeled 1, at the k-th step 
a new pendant arc, ending in the leaf labeled k + 1, is added either to a new root 



or to some edge (with all possible locations of this new pendant arc equiprobable) . 
Notice that this is not an explicit model of evolution, only of tree growth. 

The study of the probabilistic distributions of indices associated to phylogenetic 
trees under different stochastic models of phylogenetic tree growth has received a 
lot of interest in the last decades |12|19j . The ultimate goal of this line of research is 
to be able to take as null model some stochastic model of phylogenetic tree growth 
and evaluate against it the indices of a sample of phylogenetic trees reconstructed 
from data. Two of the most popular indices used in this connection, measuring the 
degree of symmetry, or balance, of a tree, are Sackin's [18] and Colless' [6] indices, 
which we define later in the main body of this paper, but there are many other 
measures associated to phylogenetic trees that have been used in this context, like 
for instance other imbalance indices Chap. 33] or the number of cherries of trees 



Several properties of the distributions of Sackin's S and Colless' C indices 
have been studied in the literature under different models |3|9|13|14|15|16|2~T] . In 
particular, their expected values have been studied under the Yule and the uni- 
form model. The results published so far on these expected values have been the 
following. Let S n and C n be the random variables defined by choosing a binary 
phylogenetic tree T with n leaves and computing S(T) or C(T), respectively. Then: 

— Under the Yule model, 



3=2 

• EY(C n ) = nlog(ra) + (7 — 1 — log(2))n + o(n) [3], where 7 is the Euler 
constant. 
— Under the uniform model, 



And, for instance, these are the formulas used by the R package apTreeshape [T] 
to compute the expected value of these indices for a given number of leaves. Let us 
also mention that Rogers [TS] found a recursive formula for the moment-generating 
functions of C and S, which allowed him to compute Ey{C ti ) and Ejj(C n ) for 
n = 1, . . . , 50, but he did not obtain any explicit formula for them. 

In this paper we obtain explicit formulas for i?y(C n ) and Ejj(S n ). Namely, 
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n 



• E Y (S n ) = 2nJ2 l/j [9, 




\n/2\ 
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where 5 dd{n) = 1 if n is odd, and S oc id(n) = if n is even, and 




where 3-F2 is a hypergeometric function [2] that can be directly computed with 
many software systems, like Mathematica or R. These formulas thus contribute to 
our knowledge of the probability distributions of these indices, and yield precise 
values which can be used is tests. 

2 Preliminaries and notations 

In this paper, by a phylogenetic tree on a set S of taxa we mean a binary rooted 
tree with its leaves bijectively labeled in the set S. To simplify the language, we 
shall always identify a leaf of a phylogenetic tree with its label. We shall also use 
the term phylogenetic tree with n leaves to refer to a phylogenetic tree on the set 
{1, . . . , n}. We shall denote by L(T) the set of leaves of a phylogenetic tree T and 
by Vi n t(T) its set of internal nodes. 

Let T(S) be the set of isomorphism classes of phylogenetic trees on a set S of 
taxa, and set T n = 7~({1, • • • , n}). It is well known [7J Ch.3] that |7i| = 1 and, for 
every n ^ 2, 

\T n \ = (2n - 3)!! = (2n - 3)(2n - 5) • • • 3 • 1. 

Whenever there exists a path from u to v in a phylogenetic tree T, we shall 
say that v is a descendant of u. The cluster of a node v in T is the set Ct(v) of its 
descendant leaves, an we shall denote by kt(v) the cardinal |Ct(u)|, that is, the 
number of descendant leaves of v. The depth 5t(v) of a node v in a phylogenetic 
tree T is the length (number of arcs) of the unique path from the root r to v. 

Given two phylogenetic trees T±, T2 on disjoint sets of taxa Si, S2, respectively, 
we shall denote by T{"T^ the tree on S\ U S2 obtained by connecting the roots of 
Ti and T2 to a (new) common parent r. Every tree in T n is obtained as T^T n _fc, 
for some subset S& C {1, . . . , n} with k elements (with 1 ^ k ^ n — 1), some tree 
Tfc on Sk and some tree T n _fc on Sjr = {1, . . . ,n} \ Sk- actually, if we perform in 
this order the choices necessary to produce a tree T G 7^ in this way, we obtain 
every tree in T n twice. 

An ordered m-forest on a set S is an ordered sequence of m phylogenetic trees 
(TjjTg, . . . ,T m ), each Tj on a set Sj of taxa, such that these sets Si are pairwise 
disjoint and their union is S. Let T m ,n De t ne set of isomorphism classes of ordered 
m-forests on {1, . . . ,n}. It is known (see, for instance, Lem. 1]) that for every 
n ^ m 1, 

(2n — m — l)!m 
1 m ' n| = (n-m)!2«-™ ' 

3 Expected value of the Colless index under the Yule model 

Let T be a phylogenetic tree. For every v G Vi n t(T), the balance value of t> is 
balxiv) = \kt{v\) — kt(v2)\, where v\ and V2 are its children. The Colless index 



[B] of a phylogenetic tree T £ T n is 



C(T) = Yl bal T {v). 

v£V lnt {T) 



Lemma 1. IfT^GTk o,nd T n _k £ T n -k, then 

(a) C{TCT n „ k ) = C(T k ) + C(T n _ fc ) + \2n - k\ 

(b) P Y (TfT n _ k ) = L^iV(T fe )iV(T n _ fc ) 

where Py denotes the probability of a phylogenetic tree under the Yule model. 



Proof. Assertion (a) is well known, and a direct consequence of the definition of 
C. Assertion (b) is a direct consequence of the explicit probabilities of T k , T n _ k 
and Tk"T n -k under the Yule model, and the fact that Vi n t(Tk"T n _k) = Vint(T k ) U 
Vint{T n ~k) U { r }> these unions being disjoint. 



Lemma 2. Let I : UneN T n ^ ^ be a mapping such that, for every phylogenetic 
trees Tx,T<2 on disjoint sets of taxa Si, £2, respectively, 



/(rrr 2 ) = j(Ti) + i(t 2 ) + /(|5i|, \s 2 \) 



for some mapping f : N x N — > M. For every n ^ 1, let I n be the random variable 
that chooses a tree T n E T n and computes I(T n ), and let Ey(I n ) be its expected 
value under the Yule model. Then, 



1 n—1 n— 1 

Ey = ( 2 S Ey {h) + n - k )) 

k=l k=l 



Proof. We compute Ey^n) using its very definition and Lemma HJ(b): 

Ey(Iu) = Yl I(T n )- PY (T n ) 
n-l 

= E E E E I ( T f'Tn-k) ■ PY{TCT n - k ) 

k=l S k c{i,...,n} T k £T(S k )T„_ k £T(S k ) 
\S k \=k 

= ^E0 E E (W+k^*) 

+/(*» n-k))- - 2 P Y (T k )P Y {T n _ k ) 

\ n - l )\k) 

- n— 1 

= EE ^U(^)+^n-fc) + /(A : ,n-A ; ))Py(T fc )Py(r„_ fe ) 

n— 1 

= E (E E nr fc )iV(r fe )Py(r n _ fe ) 

+ E E I(Tn-k)PY(Tk)PY(T n -k) 

+ E E /( fc > n - fe)JV(rfc)iV(r„- fc )) 

- n— 1 

= ^ 3T ^(^/(T fc )Py(T fc )+ £ I{T n „ k )PY(T n _ k ) + f{k,n-k)) 

fc=l T fe T n _ fc 
1 n— 1 

= r + £V(/ n -fc) + f(k, n - k)) 

n — 1 

fe=i 

1 n— 1 n— 1 

= ^r( 2 E^^) + E/^ re - A; )) 
^=1 fc=i 

as we claimed. 

Mappings / satisfying the hypothesis in the previous lemma are a special case 
of binary recursive tree shape statistics in the sense of |10j . 

Theorem 1. Let C n be the random variable that chooses a tree T E T n and com- 
putes its Colless index C(T n ). Its expected value under the Yule model is 

[n/2] 1 

E Y {C n ) = n y~] - + 5 odd (n), 
7 

i=2 J 



where 5 dd(n) = 1 if n is odd, and 5 dd{n) = if n is even. 



Proof. To simplify the notations, we shall denote Ey{C n ) simply by E n . By Lem- 
mas [U( a) and [21 

. n— 1 Ti—l 
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Now a simple computation shows that 
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if n is even 



if n is odd 



and therefore 
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2(n- 1) 
n — 1 



fc=l 



if n is even 



if n is odd 



In order to obtain a recurrence of order one from this expression, we distinguish 
the case when n is even from the case when n is odd. 

— When n is even 



n-l 



z, x - n(n — 2) 

— — 12^* + 2(^=1)' En - 1 



ii 



r, 71—2 „ 

z \ - _ n — 2 



*!=i 



n - 2 



fc=i 



and then 
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— When n is odd 
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and then 
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So, in summary, 



E„ 



n 



(n-2 



n — 1 



if n is even 



n- 1 
1 if n is odd 



In particular, if n is even, 



En 



n „ n — 2 n /n — 1_ \ n — 2 

#n-l + " = 7 A-2 + 1 + 7 

— 1 n — lVn — 2 / n — 1 
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Setting x n = E n /n, this equation becomes 



•En — ^n— 2 T 



n 



whose solution (for even numbered terms) with %2 = E2/2 = is 



n/2 

E 1 - 



Therefore, when n is even, 



and when n is odd, 
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E n = n 



i=2 



n n 

E n = r-En-i + 1 = 

n — 1 n — 



(n-l)/2 



Ln/2j 



1 ■(«-!) E 7 + 1 = 1 +-Et 

i=2 i=2 



as we claimed. 



4 Expected value of the Sackin index under the uniform model 



The Sackin index [18] of a phylogenetic tree T G T n is defined as the sum of the 
depths of its leaves: 



i=l 

Alternatively, 

S(T)= k t(v). 

vev int (T) 

Let S n be the random variable that chooses a tree T G T n and computes its 
Sackin index S(T). Since, under the uniform model, all trees in T n have probability 
l/((2n — 3)!!), the expected value of S n under the uniform model is 

£r e r B S(T) 
(2n-3)!! ' 

So, we need to compute the numerator in this fraction. Now, for every k = 
1, . . . , n — 1, let 

C k ,n = \{T eTn\5 T (l) = k}\. 

n-l 

Lemma 3. For every n ^ 3, S{T) = n k ■ Ck : n 

TeTn k=i 

Proof. Notice that, for every 1 ^ i ^ n, 

\{T G T n I <5r(») = k}\ = \{T G T n \ 5 T (l) = k}\. 

Then 



T&T n TeT„ i=l i=l TeT n 

n n— 1 

EE fc 'i{ TGr «i^) = ^i 

4 = 1 fc = l 

n n— 1 n— 1 

EE fe 'l{ Te7; I $t{1) = k}\ = nY^k ■ c Kr . 



1=1 fc=l fc=l 
Lemma 4. For every n ^ 2 and /c = 1, . . . , n — 1, 

(2rc- fc-3)!A; 
Cfc ' n ~ (n-fc-l)!2«- fc - 1 ' 

Proof. To compute c^, for A; ^ 1, notice that every tree T G T n such that 5(1) = k 
will have the form described in Fig. [TJ Therefore, it is determined by the ordered 
/c-forest Ti, T2, . . . , on {2, ... , n}, and thus 

(2n-k-3)\k 



Ck.n — \Tk,n- 



(n-k- l)!2 n - fc - 1 



Fig. 1. The structure of a tree T with <5t(1) = k. 



Now, recall that the (generalized) hypergeometric function p F q is defined [2] as 

(Ol)jfc • • • (Op)fc Z fc 



F (a x ,...,a p \ _ 



where (a)fc := a • (a + 1) • • • (a + k — 1). Many popular software systems, like 
Mathematica or R, have implementations of these functions. 



Theorem 2. The expected value of the random variable S n under the uniform 
model is 



Ejj(Sn) — 7, o 3 -^ 2 



n w (2, 2, 2-n . 
2n _ 3 3 2 U 4-2n '' 



Proof. As we have already mentioned, 



(2n-3)H (2n-3)!!^ ' (2n - 3)!! ^ (n - ife - 1)!2 
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Now 
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and thus 
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n ^4 k 2 2 k - l {n - 2)(n - 3) • • • (n - fe) 



2n 
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3 S (2n - 4)(2n - 5) • • • (2n - Jfe - 2) 

n yl fc 2 2 fc - 1 (2-n)(2-n + l)---(-n + /c) 
3 ^ (4 - 2n)(4 - 2n + 1) • • • (2 - 2n + k) 

n g (k + l) 2 2 fc (2 - n)(2 - n + 1) • • • (1 - n + k) 



2n-3^ (4-2n)(4-2n + l)---(3-2n + fc) 

n ^{{k + 1)!) 2 (2 - n)(2 - n + 1) • • • (1 - n + k) ■ 2 k 



2n ~ 3 kTo ( fc! ) 2 ( 4 - 2n)(4 - 2n + 1) • • • (3 - 2n + k) 

n ^ (2) k (2) k (2-n) k 2 k _ n f 2 , 2, 2 - n „ 

2n-3^ (l) fc ((4-2n) fe fc! 2n-3 3 \h 4-2n ' 



as we claimed. 



5 Conclusion 

In this paper we have obtained explicit formulas for the expected value of the Sackin 
index under the uniform model and the Colless index under the Yule model. These 
results add up to the already known expected value of the Sackin index under the 
Yule model [9]. For any n, these expected values are easily computed directly using 
for instance the software system R, and can be used instead of their estimations 
in packages like apTreeshape pQ or SymmeTREE [5]. 

It remains open the problem of finding an explicit formula for the expected 
value of the Colless index under the uniform model. 
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